Comparison of PAM50 classification and proteome profiles in breast cancer tissue

Introduction

PAM50 classification

Based on microarray or RNA-seq data, looking at 50 representative genes to classify breast-cancer tissues, classify samples into four categories:

  • HER2-enriched
  • Luminal A
  • Luminal B
  • Basal like

Our data

iTRAQ-based mass spectrometry proteome profiling data from 77 breast cancer tissues1

Tumor receptor status for:

  • Oestrogen (ER)
  • Progesterone (PR)
  • Human epidermal growth factor receptor 2 (HER2)

Methods and strategy

Some text about our approach, or a flow chart

  • Obtained Data from kaggle dataset
  • Dataset contains 77 breast cancer samples with ~12.000 proteins for each sample.
  • Differential expression through a Linear Model
  • Analysis of deferentially expressed genes through Volcano plot, Heatmaps, hierarchical Clustering, PCA & K-means clustering, and Correlations.

Results

Clustering based on entire data set

Differential gene expression analysis

Differential gene expression analysis using linear models: \(lm(PAM50 \sim protein)\)

Results from differential gene expression analysis

PCA with significant proteins

Hierachial clustering with significant proteins

Clinical data correlation

Proteomics data correlation

  • What did we do
  • What did we see
  • This is nice
  • We can also leave out the bullet list

Discussion

  • We were able to see some clustering of samples with the same PAM50 classification when only considering the DE genes.
    • Not surprising - but what about the samples we grouped differently?
    • PAM50 is used for risk stratification - but many are misclasffied.1
  • This data set was not large enough to say anything about risk.
    • Combining proteomics with large language models could yield better prognostic tools.
  • In addition, we investigated correlations […]